5 research outputs found
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
There is growing evidence that pretraining on high quality, carefully
thought-out tokens such as code or mathematics plays an important role in
improving the reasoning abilities of large language models. For example,
Minerva, a PaLM model finetuned on billions of tokens of mathematical documents
from arXiv and the web, reported dramatically improved performance on problems
that require quantitative reasoning. However, because all known open source web
datasets employ preprocessing that does not faithfully preserve mathematical
notation, the benefits of large scale training on quantitive web documents are
unavailable to the research community. We introduce OpenWebMath, an open
dataset inspired by these works containing 14.7B tokens of mathematical
webpages from Common Crawl. We describe in detail our method for extracting
text and LaTeX content and removing boilerplate from HTML documents, as well as
our methods for quality filtering and deduplication. Additionally, we run
small-scale experiments by training 1.4B parameter language models on
OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass
the performance of models trained on over 20x the amount of general language
data. We hope that our dataset, openly released on the Hugging Face Hub, will
help spur advances in the reasoning abilities of large language models
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
Constructing AI models that respond to text instructions is challenging,
especially for sequential decision-making tasks. This work introduces an
instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1,
demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective
for creating instruction-following sequential decision-making agents. STEVE-1
is trained in two steps: adapting the pretrained VPT model to follow commands
in MineCLIP's latent space, then training a prior to predict latent codes from
text. This allows us to finetune VPT through self-supervised behavioral cloning
and hindsight relabeling, bypassing the need for costly human text annotations.
By leveraging pretrained models like VPT and MineCLIP and employing best
practices from text-conditioned image generation, STEVE-1 costs just $60 to
train and can follow a wide range of short-horizon open-ended text and visual
instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction
following in Minecraft with low-level controls (mouse and keyboard) and raw
pixel inputs, far outperforming previous baselines. We provide experimental
evidence highlighting key factors for downstream performance, including
pretraining, classifier-free guidance, and data scaling. All resources,
including our model weights, training scripts, and evaluation tools are made
available for further research
Large Language Models Are Human-Level Prompt Engineers
By conditioning on natural language instructions, large language models
(LLMs) have displayed impressive capabilities as general-purpose computers.
However, task performance depends significantly on the quality of the prompt
used to steer the model, and most effective prompts have been handcrafted by
humans. Inspired by classical program synthesis and the human approach to
prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic
instruction generation and selection. In our method, we treat the instruction
as the "program," optimized by searching over a pool of instruction candidates
proposed by an LLM in order to maximize a chosen score function. To evaluate
the quality of the selected instruction, we evaluate the zero-shot performance
of another LLM following the selected instruction. Experiments on 24 NLP tasks
show that our automatically generated instructions outperform the prior LLM
baseline by a large margin and achieve better or comparable performance to the
instructions generated by human annotators on 19/24 tasks. We conduct extensive
qualitative and quantitative analyses to explore the performance of APE. We
show that APE-engineered prompts can be applied to steer models toward
truthfulness and/or informativeness, as well as to improve few-shot learning
performance by simply prepending them to standard in-context learning prompts.
Please check out our webpage at
https://sites.google.com/view/automatic-prompt-engineer
You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments
Recently, methods such as Decision Transformer that reduce reinforcement
learning to a prediction task and solve it via supervised learning (RvS) have
become popular due to their simplicity, robustness to hyperparameters, and
strong overall performance on offline RL tasks. However, simply conditioning a
probabilistic model on a desired return and taking the predicted action can
fail dramatically in stochastic environments since trajectories that result in
a return may have only achieved that return due to luck. In this work, we
describe the limitations of RvS approaches in stochastic environments and
propose a solution. Rather than simply conditioning on the return of a single
trajectory as is standard practice, our proposed method, ESPER, learns to
cluster trajectories and conditions on average cluster returns, which are
independent from environment stochasticity. Doing so allows ESPER to achieve
strong alignment between target return and expected performance in real
environments. We demonstrate this in several challenging stochastic offline-RL
tasks including the challenging puzzle game 2048, and Connect Four playing
against a stochastic opponent. In all tested domains, ESPER achieves
significantly better alignment between the target return and achieved return
than simply conditioning on returns. ESPER also achieves higher maximum
performance than even the value-based baselines